12 research outputs found

    High performance scientific computing in applications with direct finite element simulation

    Get PDF
    xiii, 133 p.La predicción del flujo separado, incluida la pérdida de un avión completo mediantela dinámica de fluidos computacional (CFD) se considera uno de los grandes desaf¿¿os que seresolverán en 2030, según NASA. Las ecuaciones no lineales de Navier-Stokes proporcionan laformulación matemática para flujo de fluidos en espacios tridimensionales. Sin embargo, todaviafaltan soluciones clásicas, existencia y singularidad. Ya que el cálculo de la fuerza bruta esintratable para realizar simulación predictiva para un avión completo, uno puede usar la simulaciónnumérica directa (DNS); sin embargo, prohibitivamente caro ya que necesita resolver laturbulencia a escala de magnitud Re power (9/4). Considerando otros métodos como el estad¿¿sticopromedio Reynolds¿s Average Navier Stokes (RANS), spatial average Large Eddy Simulation(LES), y Hybrid Detached Eddy Simulation (DES), que requieren menos cantidad de grados delibertad. Todos estos métodos deben ajustarse a los problemas de referencia y, además, cerca las paredes, la malla tieneque ser muy fina para resolver las capas l¿¿mite (lo cual significa que el costo computacional es muycostoso). Por encima de todo, los resultados son sensibles a, por ejemplo, parámetros expl¿¿citos enel método, la malla, etc.Como una solución al desaf¿¿o, aqu¿¿ presentamos la adaptación Metodolog¿¿a de solución directa deFEM (DFS) con resolución numérica disparo, como una familia predictiva, libre de parámetros demétodos para flujo turbulento. Resolvimos el modelo de avión JAXA Standard Model (JSM) ennúmero realista de Reynolds, presentado como parte del High Lift Taller de predicción 3.Predijimos un aumento de Cl dentro de un error de 5 % vs experimento, arrastre Cd dentro de 10 %error y detenga 1 ¿ dentro del ángulo de ataque.El taller identificó un probable experimento error depedido 10 % para los resultados de arrastre. La simulación es 10 veces más rápido y más barato encomparación con CFD tradicional o existente enfoques. La eficiencia proviene principalmente dell¿¿mite de deslizamiento condición que permite mallas gruesas cerca de las paredes, orientada aobjetivos control de error adaptativo que refina la malla solo donde es necesario y grandes pasos detiempo utilizando un método de iteración de punto fijo tipo Schur, sin comprometer la precisión delos resultados de la simulación.También presentamos una generalización de DFS a densidad variable y validado contra el problemade referencia MARIN bien establecido. los Los resultados muestran un buen acuerdo con losresultados experimentales en forma de sensores de presión. Más tarde, usamos esta metodolog¿¿apara resolver dos aplicaciones en problemas de flujo multifásico. Uno tiene que ver con un flashtanque de almacenamiento de agua de lluvia (consorcio de agua de Bilbao), y el segundo es sobre eldiseño de una boquilla para impresión 3D. En el agua de lluvia tanque de almacenamiento,predijimos que la altura del agua en el tanque tiene un influencia significativa sobre cómo secomporta el flujo aguas abajo de la puerta del tanque (válvula). Para la impresión 3D,desarrollamos un diseño eficiente con El flujo de chorro enfocado para evitar la oxidación y elcalentamiento en la punta del boquilla durante un proceso de fusión.Finalmente, presentamos aqu¿¿ el paralelismo en múltiples GPU y el incrustado sistema dearquitectura Kalray. Casi todas las supercomputadoras de hoy tienen arquitecturas heterogéneas,1 See the UNESCO Internacional Standard nomenclature for fields of Science and Technologyacomo CPU+GPU u otros aceleradores, y, por lo tanto, es esencial desarrollar marcoscomputacionales para aprovecha de ellos. Como lo hemos visto antes, se comienza a desarrollar eseCFD más tarde en la década de 1060 cuando podemos tener poder computacional, por lo tanto, Esesencial utilizar y probar estos aceleradores para los cálculos de CFD. Las GPU tienen unaarquitectura diferente en comparación con las CPU tradicionales. Técnicamente, la GPU tienemuchos núcleos en comparación con las CPU que hacen de la GPU una buena opción para elcómputo paralelo.Para múltiples GPU, desarrollamos un cálculo de plantilla, aplicado a simulación depliegues geológicos. Exploramos la computación de halo y utilizamos Secuencias CUDA paraoptimizar el tiempo de computación y comunicación. La ganancia de rendimiento resultante fue de23 % para cuatro GPU con arquitectura Fermi, y la mejora correspondiente obtenida en cuatro LasGPU Kepler fueron de 47 %.This research was carried out at the Basque Center for Applied Mathematics (BCAM) within the CFD Computational Technology (CFDCT) and also at the School of Electrical Engineering and Computer Science(Royal Institue of Technology, Stockholm, Sweden). Which is suported by Fundacion Obra Social “la Caixa“, Severo Ochoa Excellence research centre 2014-2018 SEV-2013-0323, Severo Ochoa Excellence research centre 2018-2022 SEV-2017-0718, BERC program 2014-2017, BERC program 2018-2021, MSO4SC European project, Elkartek. This work has been performed using the computing infrastructure from SNIC (Swedish National Infrastructure for Computing)

    Security in an evolving European HPC Ecosystem

    Get PDF
    The goal of this technical report is to analyse challenges and requirements related to security in the context of an evolving European HPC ecosystem, to provide selected strategies on how to address them, and to come up with a set of forward-looking recommendations. A key assumption made in this technical report is that we are in a transition period from a setup, where HPC resources are operated in a rather independent manner, to centres providing a variety of e-infrastructure services, which are not exclusively based on HPC resources and are increasingly part of federated infrastructures

    Hybrid CPU-GPU Parallel Simulations of 3D Front Propagation

    No full text
    This master thesis studies GPU-enabled parallel implementations of the 3D Parallel Marching Method (PMM). 3D PMM is aimed at solving the non-linear static Jacobi-Hamilton equations, which has real world applications such as in the study of geological foldings, where each layer of the Earth’s crust is considered as a front propagating over time. Using the parallel computer architectures, fast simulationscan be achieved, leading to less time consumption, quicker understanding of the inner Earth and enables early exploration of oil and gas reserves. Currently 3D PMM is implemented in shared memory architecture using OpenMP Application Programming Interface (API) and the MINT programming model, which translates C code into Compute Unified Device Architecture (CUDA) code for a single Graphical Process Unit (GPU). Parallel architectures have seen rapid growth in recent years, especially GPUs, allowing us to do faster simulations. In this thesis work, a new parallel implementation for 3D PMM has been done to exploit multicore CPU architectures as well as single and multiple GPUs. In a multiple GPU implementation, 3D data isdecomposed into 1D data for each GPU. CUDA streams are used to overlap the computation and communication within the single GPU. Part of the decomposed 3D volume data is kept in the respective GPU to avoid complete data transfer between the GPUs over a number of iterations. In total, there are two kinds of datatransfers that are involved while doing computation in the multiple GPUs: boundary value data transfer and decomposed 3D volume data transfer. The decomposed 3D volume data transfer is optimized between the multiple GPUs by using the peer to peer memory transfer in CUDA. The speedup is shown and compared between shared memory CPUs (E5-2660, 16cores), single GPU (GTX-590, C2050 and K20m) and multiple GPUs. Hand coded CUDA has shown slightly better performance than the Mint translated CUDA, and the multiple GPU implementation showed promising speedup compared to shared memory multicore CPUs and single GPU implementations

    Hybrid CPU-GPU Parallel Simulations of 3D Front Propagation

    No full text
    This master thesis studies GPU-enabled parallel implementations of the 3D Parallel Marching Method (PMM). 3D PMM is aimed at solving the non-linear static Jacobi-Hamilton equations, which has real world applications such as in the study of geological foldings, where each layer of the Earth’s crust is considered as a front propagating over time. Using the parallel computer architectures, fast simulationscan be achieved, leading to less time consumption, quicker understanding of the inner Earth and enables early exploration of oil and gas reserves. Currently 3D PMM is implemented in shared memory architecture using OpenMP Application Programming Interface (API) and the MINT programming model, which translates C code into Compute Unified Device Architecture (CUDA) code for a single Graphical Process Unit (GPU). Parallel architectures have seen rapid growth in recent years, especially GPUs, allowing us to do faster simulations. In this thesis work, a new parallel implementation for 3D PMM has been done to exploit multicore CPU architectures as well as single and multiple GPUs. In a multiple GPU implementation, 3D data isdecomposed into 1D data for each GPU. CUDA streams are used to overlap the computation and communication within the single GPU. Part of the decomposed 3D volume data is kept in the respective GPU to avoid complete data transfer between the GPUs over a number of iterations. In total, there are two kinds of datatransfers that are involved while doing computation in the multiple GPUs: boundary value data transfer and decomposed 3D volume data transfer. The decomposed 3D volume data transfer is optimized between the multiple GPUs by using the peer to peer memory transfer in CUDA. The speedup is shown and compared between shared memory CPUs (E5-2660, 16cores), single GPU (GTX-590, C2050 and K20m) and multiple GPUs. Hand coded CUDA has shown slightly better performance than the Mint translated CUDA, and the multiple GPU implementation showed promising speedup compared to shared memory multicore CPUs and single GPU implementations

    Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding

    Get PDF
    This paper studies the CUDA programming challenges with using multiple GPUs inside a single machine to carry out plane-by-plane updates in parallel 3D sweeping algorithms. In particular, care must be taken to mask the overhead of various data movements between the GPUs. Multiple OpenMP threads on the CPU side should be combined multiple CUDA streams per GPU to hide the data transfer cost related to the halo computation on each 2D plane. Moreover, the technique of peer-to-peer data motion can be used to reduce the impact of 3D volumetric data shuffles that have to be done between mandatory changes of the grid partitioning. We have investigated the performance improvement of 2-and 4-GPU implementations that are applicable to 3D anisotropic front propagation computations related to geological folding. In comparison with a straightforward multi-GPU implementation, the overall performance improvement due to masking of data movements on four GPUs of the Fermi architecture was 23%. The corresponding improvement obtained on four Kepler GPUs was 47%

    Edge Computing: An Overview of Framework and Applications

    Get PDF
    This report gives an overview of the Edge Computing paradigm and its applications. Indeed, with the advent of the Internet of Things (IoT) era, many electronic devices and sensors produce a vast volume of data which should be processed in a timely manner and this novel computing model is nowadays seen as a pertinent answer to this open challenge. This report thus explains why Edge Computing is needed and how the edge architecture is typically structured. It further presents the technologies that help this cutting-edge model to function properly. Since Edge Computing involves a heterogeneous architecture, it requires to adapt to a few technological recommendations for optimal performance. In this context, this report reviews the latest hardware technology trends tied to Edge Computing developments and points out technical challenges implementing this innovative computing model. In particular, we analyse how High-Performance Computing and CloudComputing infrastructures can be efficiently organised to design an Edge Computing-based framework able to tackle cutting-edge issues solved by Artificial Intelligence techniques. Finally, this report presents selected real-world applications of the Edge Computing paradigm across multiple domains affecting our daily life, i.e., healthcare, smart city and grids, industry 4.0 and public safet

    Towards HPC-Embedded. Case Study: Kalray and Message-Passing on NoC

    No full text
    Today one of the most important challenges in HPC is the development of computers with a low power consumption. In this context, recently, new embedded many-core systems have emerged. One of them is Kalray. Unlike other many-core architectures, Kalray is not a co-processor (self-hosted). One interesting feature of the Kalray architecture is the Network on Chip (NoC) connection. Habitually, the communication in many-core architectures is carried out via shared memory. However, in Kalray, the communication among processing elements can also be via Message-Passing on the NoC. One of the main motivations of this work is to present the main constraints to deal with the Kalray architecture. In particular, we focused on memory management and communication. We assess the use of NoC and shared memory on Kalray. Unlike shared memory, the implementation of Message-Passing on NoC is not transparent from programmer point of view. The synchronization among processing elements and NoC is other of the challenges to deal with in the Karlay processor. Although the synchronization using Message-Passing is more complex and consuming time than using shared memory, we obtain an overall speedup close to 6 when using Message-Passing on NoC with respect to the use of shared memory. Additionally, we have measured the power consumption of both approaches. Despite of being faster, the use of NoC presents a higher power consumption with respect to the approach that exploits shared memory. This additional consumption in Watts is about a 50%. However, the reduction in time by using NoC has an important impact on the overall power consumption as well.This research has been supported by EU-FET grant EUNISON 308874, the Basque Excellence Research Center (BERC 2014-2017) program by the Basque Government, the projects founded by the Spanish Ministry of Economy and Competitiveness (MINECO): BCAM Severo Ochoa accreditation SEV-2013-0323, MTM2013-40824 and TIN2015-65316-P. We acknowledge Research Center in Real-TIme & Embedded Computing Systes - CISTER for the provided resources

    RESIF 3.0: Toward a Flexible & Automated Management of User Software Environment on HPC facility

    Get PDF
    High Performance Computing (HPC) is increasingly identified as a strategic asset and enabler to accelerate the research and the business performed in all areas requiring intensive computing and large-scale Big Data analytic capabilities. The efficient exploitation of heterogeneous computing resources featuring different processor architectures and generations, coupled with the eventual presence of GPU accelerators, remains a challenge. The University of Luxembourg operates since 2007 a large academic HPC facility which remains one of the reference implementation within the country and offers a cutting-edge research infrastructure to Luxembourg public research. The HPC support team invests a significant amount of time (i.e., several months of effort per year) in providing a software environment optimised for hundreds of users, but the complexity of HPC software was quickly outpacing the capabilities of classical software management tools. Since 2014, our scientific software stack is generated and deployed in an automated and consistent way through the RESIF framework, a wrapper on top of Easybuild and Lmod [5] meant to efficiently handle user software generation. A large code refactoring was performed in 2017 to better handle different software sets and roles across multiple clusters, all piloted through a dedicated control repository. With the advent in 2020 of a new supercomputer featuring a different CPU architecture, and to mitigate the identified limitations of the existing framework, we report in this state-of-practice article RESIF 3.0, the latest iteration of our scientific software management suit now relying on streamline Easybuild. It permitted to reduce by around 90% the number of custom configurations previously enforced by specific Slurm and MPI settings, while sustaining optimised builds coexisting for different dimensions of CPU and GPU architectures. The workflow for contributing back to the Easybuild community was also automated and a current work in progress aims at drastically decrease the building time of a complete software set generation. Overall, most design choices for our wrapper have been motivated by several years of experience in addressing in a flexible and convenient way the heterogeneous needs inherent to an academic environment aiming for research excellence. As the code base is available publicly, and as we wish to transparently report also the pitfalls and difficulties met, this tool may thus help other HPC centres to consolidate their own software management stack

    A Multi-GPU Fast Iterative Method for Eikonal Equations using On-the-fly Adaptive Domain Decomposition

    No full text
    The recent research trend of Eikonal solver focuses on employing state-of-the-art parallel computing technology, such as GPUs. Even though there exists previous work on GPU-based parallel Eikonal solvers, only little research literature exists on the multi-GPU Eikonal solver due to its complication in data and work management. In this paper, we propose a novel onthe-fly, adaptive domain decomposition method for efficient implementation of the Block-based Fast Iterative Method on a multi-GPU system. The proposed method is based on dynamic domain decomposition so that the region to be processed by each GPU is determined on-the-fly when the solver is running. In addition, we propose an efficient domain assignment algorithm that minimizes communication overhead while maximizing load balancing between GPUs. The proposed method scales well, up to 6.17?? for eight GPUs, and can handle large computing problems that do not fit to limited GPU memory. We assess the parallel efficiency and runtime performance of the proposed method on various distance computation examples using up to eight GPUs
    corecore